Exploratory Analysis

Exploratory Analysis

So you come into work on a Monday morning and find that your boss has tasked you with finding some insights on a current data set. You open the excel file and see that there are countless rows and columns filled with data. What do you do next?

Well, because you are a trendy analyst or someone just looking to learn new skills, you decide to use Python.

In this short blog, I will teach you how to begin looking for insights into your data, or in other terms, exploratory analysis. We will use Python 3 and the libraries within Python to start our journey.

First off, you are going to import the Pandas library, Matplotlib library, and the Seaborn library.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

These are just a few of the libraries that are available for use within the Python language and in our case, will be the main libraries used for exploratory analysis.

Next, you will call your bosses saved excel file name that you would have saved as a CSV file.

example_1 = pd.read_csv('~/Documents/projects/project_1/data/act_2018.csv')

Now that you have loaded your file into the notebook, or text editor, you can begin the initial analysis with some Pandas functions.

Run the .head() function to show the first five rows, and all the column names to see what each is called - this will help when you start plotting later.

example_1.head()
State Participation English Math Reading Science Composite
0 National 55 20.2 20.5 21.3 20.7 20.8
1 Alabama 100 18.9 18.3 19.6 19.0 19.1
2 Alaska 33 19.8 20.6 21.6 20.7 20.8
3 Arizona 66 18.2 19.4 19.5 19.2 19.2
4 Arkansas 100 19.1 18.9 19.7 19.4 19.4

Next, run the .tail() function to show the bottom of your DataFrame.

example_1.tail()
State Participation English Math Reading Science Composite
47 Virginia 24 23.8 23.3 24.7 23.5 23.9
48 Washington 24 21.4 22.2 22.7 22.0 22.2
49 West Virginia 65 19.8 19.4 21.3 20.4 20.3
50 Wisconsin 100 19.8 20.3 20.6 20.8 20.5
51 Wyoming 100 19.0 19.7 20.6 20.3 20.0

Now that you have a basic understanding of your data, you should check to see if there are any null/zero values in your columns and what type of data is in your columns.

example_1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52 entries, 0 to 51
Data columns (total 7 columns):
State            52 non-null object
Participation    52 non-null int64
English          52 non-null float64
Math             52 non-null float64
Reading          52 non-null float64
Science          52 non-null float64
Composite        52 non-null float64
dtypes: float64(5), int64(1), object(1)
memory usage: 3.0+ KB

In this data set, you can see that there are no null values and that there is a mixture of object type data and float type data and int type data.

Next, you can use the .describe().T function to see things like min and max values in each column, as well as standard deviation and the mean in each column.

example_1.describe().T
count mean std min 25% 50% 75% max
Participation 52.0 61.519231 33.757782 7.0 29.250 65.50 100.000 100.0
English 52.0 20.973077 2.424719 16.6 19.100 20.20 23.700 26.0
Math 52.0 21.113462 2.017573 17.8 19.400 20.65 23.125 25.2
Reading 52.0 22.001923 2.148186 18.0 20.475 21.45 24.050 26.1
Science 52.0 21.332692 1.853848 17.9 19.925 20.95 23.025 24.9
Composite 52.0 21.473077 2.087696 17.7 19.975 21.05 23.525 25.6

You can see with just a few simple functions from the Pandas library; we have gained a significant understanding of our numerical data.

And now, its time to bring in the next library, Matplotlib, which gives us the ability to start plotting some data.

Let's get started first by plotting each column in simple histograms. Maybe it will provide some insight into our data - like if it is normalized or not.

plt.figure(figsize=(4,4))
plt.hist(example_1['English']);

png

plt.figure(figsize=(4,4))
plt.hist(example_1['Math']);

png

plt.figure(figsize=(4,4))
plt.hist(example_1['Reading']);

png

plt.figure(figsize=(4,4))
plt.hist(example_1['Science']);

png

plt.figure(figsize=(4,4))
plt.hist(example_1['Composite']);

png

You can use Matplotlib to plot a few different styles of graphs quickly, but for exploratory, I like to keep things simple with either histograms or scatterplots because they visualize the data efficiently.

Remember, when doing exploratory analysis, you don't need to make things pretty, you need to make things easy for you to read and understand.

The last visual image I like to use is a heatmap; it shows correlation among our numerical columns, cleanly and effectively. This is where we use Seaborn.

fig, ax = plt.subplots(figsize=(6,6))  
sns.heatmap(example_1.corr(), annot=True, cmap="icefire");

png

You can see the value of the heatmap right away. The human eye interprets colour coordination quickly and effectively. Using the "icefire" colorway, you can see that Participation is negatively correlated with all of the scores provided by the ACT. A powerful insight that you maybe wouldn't have noticed without the power of a heatmap.

And that's it. In less than 20 minutes, expert or not in Python, you can quickly explore your data and see what trends and interpretations you can make. You can now report back to your boss what you are seeing, or if you feel like you'd want to go further, check back soon for the next blog!